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Abstract 

Consider a 0-1 observation matrix M, where rows correspond to entities and columns correspond 
to signals; a value of 1 (or 0) in cell {i,j) of M indicates that signal j has been observed (or not 
observed) in entity i. Given such a matrix we study the problem of inferring the underlying directed 
links between entities (rows) and finding which entries in the matrix are initiators. 

We formally define this problem and propose an MCMC framework for estimating the links and 
the initiators given the matrix of observations M. We also show how this framework can be extended 
to incorporate a temporal aspect; instead of considering a single observation matrix M we consider 
a sequence of observation matrices Mi , . . . , Alt over time. 

We show the connection between our problem and several problems studied in the field of social- 
network analysis. We apply our method to paleontological and ecological data and show that our 
algorithms work well in practice and give reasonable results. 

1 Introduction 

Analyzing 0-1 matrices is one of the main themes in data mining. Techniques such as clustering or 
mixture modelling, matrix decomposition techniques such as PCA, ICA, and NMR, and Bayesian all 
aim to give an answer to the informal question: "Where does the matrix come from?" These approaches 
aim at describing a probabilistic generative model that describes the observed matrix well. 

In this paper we consider yet another way of answering the question "Where does a 0-1 matrix M 
come from?" In our model, the matrix M of size n x m is considered to arise from initiators, certain few 
entries that are initially 1. The initiators propagate their I's by following the links of a directed influence 
graph G (represented by an n x n adjacency matrix). We denote the initiator matrix of size rt x to by 
N and we use G (of size n x n) to refer both to the directed graph between the rows of M and as well 
as its adjacency matrix. Then, we believe that the structure of N and G can tell how a matrix M has 
been created. As an example, consider the following data matrix M : 

ri 1 1 1 
r2 1 1 1 1 1 1 
ra 1 1 1 

We can describe this matrix by considering the 3x6 initiator matrix TV and 3x3 adjacency matrix 

G: 
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The directed graph G tells that row ri influences row r2, and that also row influences row r2. 

The structure of G and N of course depends on the assumed propagation model. For the purposes 
of this paper we use a propagation model that is similar in spirit to the ones used in [8, 11, 15, 17]. 
Informally, matrices G and N specify the probability for each entry of M being 1, as follows: the 
positions that already have value 1 in will have value 1 in M (almost) certainly. Otherwise, a value of 
1 in entry N{a,j) propagates to entries M{b,j) with probability p, where p is proportional to and 
L is the length of shortest path from 6 to o in G. If there are multiple entries N{a,j) = 1 such that a 
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and b arc connected in G, the probabilities are combined in the natural way. Thus, given G and A'' the 
propagation model gives a probability of each entry in M being 1 or 0. From these probabilities we can 
compute the likelihood Pt{M\G,N). 

In our case though, G and N are unknown and the task is to estimate them given a matrix of 
observations M. As a data-analysis task this setting can seem to be quite underspecified: after all, we 
start from an n x m matrix M, and as output we obtain an n x m matrix N and n x n adjacency 
matrix G. Thus, there are more bits in the output than there are in the input. For example, we could 
always take N = M and let G be empty. We adopt a Bayesian approach to handle this issue. We prefer 
solutions where the number x of Is in A'' is small, so we put a prior for A'' that is proportional to for 
some constant c > 1; a similar prior on the number of edges in G implements our preference for graphs 
with a small number of edges. 

Our task is then the following: given an observed 0-1 matrix M, sample graph-initiator pairs (G, A'') 
from the posterior distribution 

Vv{G,N\M) oc Pr(G)Pr(A/')Pr(M|G, A/'). 

Prom such samples, we can then compute the average strength that an edge is present in G or that a 
particular entry is present in N . 

We show that this approach yields intuitive results on real ecological and paleontological data sets. 
While the method converges relatively slowly, it is still usable for data sets with hundreds of rows and 
columns. 

Related work: Here we give a brief summary of related work. The problem of inferring a network 
structure given a set of observations is not new; there has been lots of research for finding graphical 
models (see for example [4, 10] and references therein) and Bayesian networks in particular (see [5, 12] 
and references therein). Network discovery problems have also been studied a lot in biological applications 
where the focus is to to find metabolic networks or gene regulatory networks given a set of observations, 
see [14] for a 37-page bibliography on these themes. Although our problem is related to the problems 
studied in the above papers, this relationship remains at a fairly high level. For example, we are not 
aware of any graph reconstruction problem that assumes our information-propagation model. Moreover, 
the aspect of identifying the initiators as well as the underlying link structure that we bring here is new. 

To a certain extent the work of [1] and [16] is also related to ours. The authors in these papers try to 
identify influential individuals and initiators by looking at blog postings ([1]) and purchase data ([16]). 
However, the aspect of estimation of links between the individuals is missing there. 

Finally, our problem is related to the problem of identifying nodes in a social network to achieve 
maximum diffusion of information as studied in [8, 11] or the opposite problem of identifying nodes in 
a network to maximally contain the spread of an epidemic studied for example in [2, 7, 13]. The main 
difference between these problems and ours is that we do not assume any given network as part of the 
input. A more detailed discussion about the connection of our problem to the maximum information- 
propagation problems studied in the past is given in Section 6. 

Roadmap: The rest of this paper is organized as follows: in Section 2 we describe the propagation 
model we assume that given a graph determines the transition of 1-entries from the initiators to the 
receivers. In Section 3 we describe our algorithm for sampling pairs of graphs and initiators from the 
right posterior distribution and in Section 4 we show how to extend this algorithm in the presence of more 
than one instances of the observation matrices. Some indicative experiments on real data are given in 
Section 5. In Section 6 we discuss the problem using a more general viewpoint and point out connections 
to existing work. We conclude in Section 7. 

2 The propagation model 

Consider n x m 0-1 matrix M such that every row of the matrix corresponds to an entity and every 
column correspond to a signal. The entries of the matrix take values {0, 1} and M{i,j) = 1 {M{i,j) = 0) 
is interpreted signal j has been observed (not observed) at entity j. We call matrix M the observation 
matrix. Since our matrices contain only O's and I's, we also give to them a set interpretation. 

The underlying assumption is that the rows of M (the entities) are organized in an underlying 
(directed) graph G. The interactions of the nodes in this graph determine the formation of the final 
matrix M. Additionally, let N he annxm 0-1 matrix that gives information about the initiators of the 
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propagation. That is, if N{i,j) = 1, then the i-th entity has initiated the propagation of signal j. We 
call matrix N the initiators matrix. 

Assume a (known) directed network G and initiators matrix N. We consider the following information- 
propagation model: a signal u is observed at node i, that is not an initiator for u, if u is transferred to i 
from one of the initiators of u via a path in G. 

The longer the path from an initiator j to i the less influence node j has in the appearance of u in 
node i. We use function b(j, i, G) to denote the influence that node j has to node i in graph G. If we 
use daij, i) to denote the length of the shortest path from j to i in the graph G, then we have 

h{3,i,G) = a^^^^'^K (1) 

where a S [0, 1]. The intuition behind the form of Equation (1) is that the influence a node had on any 
other node in the graph decays exponentially with their graph distance. We call this propagation model 
the shortest path (SP) model. 

Next we define the likelihood of the observed data M given G and N. Assuming that all entries in 
M are independent we have that 

n m 

Pr(M|G,7V) = Y[Y[PiiM{i,u)\G,N). 

i=l u=l 

Each individual term Pi{M{i,u)\G, N) is easy to define. First recall that each entry M{i,u) can take 
values or 1. The case M{i,u) = occurs when no 1 in the u column of iV propagates to row i and 
N{i,u) = 0. That is, 

Pr(M(i,it) = 0\G,N) = 

= (1 - N{i, u)) N{j, u)b{j, I, G)) 

n 

For the case M{i,u) = 1 we have 

Pi{M{i,u) ^1\G,N) = l-Pr{M{i,u) = 1\G,N) 

Value of the parameter a: For a given graph G, initiator matrix N and observation matrix M, the 
correct value of the parameter a is easily inferred to be the one that maximizes the probability of the 
data M given G and N. 



3 Sampling graphs and initiators 
3.1 Basics 

Our input is the observations matrix M of size n x m, and the goal is to sample graphs and initiators 
from the posterior probability distribution Pr(G, A''|M). Note that we are interested in directed graphs 
and therefore every time we refer to a graph we will imply that it is directed, unless specified otherwise. 
Using Bayes rule we get that Pr(G, N\M) is equivalent to the following expression: 

oc Pr(M|G,iV)Pr(G, A^) (3) 
= Pr(M|G, A)Pr(G)Pr(A). (4) 

Step (2) in the above derivation is due to Bayes rule, Step (3) is due to the fact that Pr(M) is constant 
and finally. Step 4 comes from the assumption that G and N are independent. Alternatively, we could 
assume that the position of a node in a graph affects it as being an initiator or not; in this case instead 
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of Pt{G.N) = Pr(G)Pr(AO wc would have Pr(G, A^) = Pr(G)Pr(Af Despite this being a reasonable 
assumption, for simplicity we assume independence in this paper. 

Therefore, given a constant normalizing factor Z, our goal is to actually sample from the distribution 

7r(G,A^|M) = ^Pr(M|G,Ar)Pr(Af)Pr(G). (5) 

The term Pr(M|G, was already discussed in the previous section. The other two terms of Equation (5), 
Pr(G) and Pr(A'^) encode our prior beliefs about the form of the graph and the initiator matrix. In 
principle we want both G and N to be as condensed as possible. That is, we want to penalize for 
graphs with large number of edges, and for initiator matrices with lots of initiators initiating the same 
product. We encode these beliefs by penalizing for any additional 1-entry in G and in N . Thus we set, 
Pr(G) oc e""^^'^' and Pr(A^) oc e"'^^'''^'. Parameters c\ and C2 are constants, \N\ is the number of 1-entries 
in matrix A'' and \E\\^ the number of edges in G. 

3.2 MCMC algorithm 

In this section we show how to sample graph-initiator pairs from the posterior probability distribution 
TT. We start by describing a naive way of sampling graph-initiator pairs uniformly at random by means 
of a random walk on a Markov chain C. The state space of C are all the possible graph-initiator pairs. 
Then we show how to modify this chain to allow for getting samples from the right posterior probability 

TT. 

Consider the following naive algorithm for sampling the state space of graph- initiator pairs. Start 
with a graph- initiator pair {Go,No) and perform a local moves to other graph-initiator pairs. When 
many local moves have been performed, consider the resulting pair (G, N) in the sample. We call this 
algorithm the Naive algorithm, and in order to fully specify it we only need to additionally define what 
we mean by "local moves" . 

A local move from graph-initiator pair {G,N) to the pair {G',N') can be defined by a cell 
{i 7^ j) in G or a cell {k,£) in A''. The new pair {G',N') is formed by updating G and A^ so that 
G'{i,j) = (= 1) if G{i,j) = 1 (= 0), and N'{k,£) = 1 (= 0) if N{k,£) = (= 1). More specifically, 
given (G, A'') a LocalMove routine first uniformly picks G or N to modify and then performs one of the 
changes described above. 

Formally, a local move is a step on a Markov chain C = {S, T}, where the state space S consists of 
the set of all pairs of graphs and initiator matrices; T is the set of transitions defined by local moves. 
In other words, the set T contains all pairs (^{G,N), {G',N')^ such that it is possible to obtain {G',N') 
from (G, A^) (or vice versa) via local moves. 

Notice that the Markov chain C (a) is connected and (b) all the states in <S have the same degree. 
Therefore chain C converges to a uniform stationary distribution. As a result the Naive sampling 
algorithm also samples graph-initiator pairs uniformly at random. 

However, we want to sample graphs and initiators from the posterior probability distribution tt 
(Equation (5)). This is done by using the Metropolis-Hastings algorithm. In every state {G,N) the 
transition to state {G',N') suggested by a random local move is accepted with probability 

mm i 1, — -— ; — — \. 

^ ' 7r(G,A^|Af) ' 

Algorithm 1 gives the pseudocode for the Metropolis-Hastings algorithm. 

Quantities to monitor: As noted above, our goal is not to find a single pair (G, A'') that would have 

a high likelihood. Rather, we want to estimate the posterior probabilities that a link exists between 
two nodes, or that a particular entry is an initiator. To do this, we collect (after a sufficient burn-in 
period) statistics on how many times each entry in the matrices G and N is present. We denote the 
average values of G by G and the averages of the initiator matrix by A''. We report these quantities in 
our experiments. 

Starting state and running times: In principle a run of the Metropolis-Hastings algorithm can 
start from any arbitrary state and converge to the desired stationary probability distribution. However, 
this may take exponentially many steps. In our experiments we use as a starting state the graph-initiator 
pair (Go, A'^o) where Nq = M and Go is the empty graph. 
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Algorithm 1 Metropolis-Hastings 

1: Input: An initial state {Go,Nq) and a number of random walk steps s 

2: Output: A pair of graph and initiator (G, N) 

3: {G,N) ^ (G",iV') 

4; while s > do 

5: {G', N') ^LocalMove(G, N) 

6: (G, N) ^ {G' , N') with probability min { 1 , ^jg^ff } 

7: S <— S — 1 

8: end while 

9: return (G, N) 



Given the current state (G,N), every iteration of the Metropolis-Hastings algorithm requires a 
computation of the posterior probability distribution for the proposed state {G',N'). If the proposed 
state (G', N') is such that G' = G then the computation of the posterior probability tt requires 0{n^m) 
time. If on the other hand G' ^ G and TV' — N, then the computation of the posterior probability requires 
additionally a computation of all-pairs-shortest paths on the newly formed graph G' and therefore it 
requires 0{n^m + v? logn) time. 

4 Incorporating temporal information 

Instead of a single observation matrix M, assume an ordered (by time) sequence of observation matrices 
Ml, .... Mt- That is, Mi corresponds to the observation matrix at some timestamp ?'. The assumption is 
that the set of 1-entries in matrix Mt is a superset of the 1-entries of matrix Mt-\ for every t G {1, . . . , T}, 
i.e., for every 1 < t < T we have Mt C Mj+i. For example, once a signal is observed at some entity, 
then this is a recorded fact and cannot be cancelled. 

As before, given observation matrices Mi, . . . ,Mt our goal is to sample graph-initiator pairs from 
the posterior distribution of graphs and initiators given the observation matrices. Note that in this 
case every time instance t is associated with its own initiator matrix Nt. Therefore, in this case we arc 
dealing with graph-initiator pairs but rather with pairs of graphs and sequences of initiators (G, iV), 
where N = {Ni, . . . , Nt}- If Nt{i, u) = 1, then it means that signal u has been initiated by i at time t. 
We use TV to denote the sequence of initiators A^i^ . • , Nt and M to denote the sequence of observation 
matrices Mi, . . . , Mt- Moreover, we use Nt and Mt to denote the seguence^ the^first elements of the 
initiators and observations sequences respectively. Therefore Nt = N and Mt = M. 

Using again Bayes rule we get that the posterior probability ttt from which we want to sample is 

TrT{G,N\M) = Pr(G,iV|M) 

Pv{M\G,N)Pv{G,N) 
Pr(M) 
oc Pr(M|G,JV)Pr(G,7V) 

= Pr(M|G,7V)Pr(G)Pr(7V). (6) 

We have 

Pr(Ar) = Pv{N^,...,Nt) 

= Pr(TVi) X ... X Pv{Nt). 

Using now the definition of conditional probability, the probability Pr(M|G, N) can be easily written 
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as follows: 

Pt{Mt\G,Nt) 

= Pj:{MT\G,MT-i,NT)xFr{MT-i\G,NT) (7) 
= Pj:{Mt\G,Mt-i,Nt)xPv{Mt-i\G,Nt-i) (8) 

= Pt{Mt\G,Mt-i,Nt) X Pr{MT-i\G,MT-2,NT-i) 
X ... X Pr(Mi|G,7Vi). 

In the above equation, line (7) is by the definition of conditional probability. In line (8) we substitute 
Pv{Mt\G, Mt-i,Nt) with Pv{Mt\G, Mt-i, Nt), because of the assumption that Mt C Mf+i. Moreover 
wo substitute Pr(MT_i|G. Nt) with Pr(MT-i\G, Nt-i) since instances Mi, . . ., Mt-i can only depend 
on initiators A^i, . . ., Nt-i and cannot depend on the initiators of time T. 
As before, for an observation matrix Mt we have that 

Pi{Mtii,u) = 0\G,Mt-i,Nt) = 

n 

= [] (l - Mt_i(i, u)b{3, i, G) - Nk{j, u)b{j, i, G) 

3=1 

+Mt-i{j,u)Nk{j,u)b{j,i,G)^ 

n 

= Jl{l-{Mt_i{j,u) + Nk{j,u) 

3=1 

That is, the probability that Mt{i, u) = is equal to the the probability that i does not get u from any 
of the nodes that had it at some previous point in time neither did it get it from any of the nodes that 
initiated u at time t. Naturally, the probability that Mt{i,u) = 1 is 

PT{Mt{i,u) = l\G,Mt-uNt-i) = 

= l-PT{Mt{i,u) = 0\G,Mt_uNt_i). 

The Metropolis-Hastings algorithm described in the previous section can be applied for the sam- 
pling of pairs of graphs and initiator sequences {G,N) from the posterior probability distribution tt'. 
There are three differences though. First, the state space of the Markov chain will consist of the pairs of 
all possible graphs and initiator sequences. A local move from state (G, N) to (G'N') would randomly 
pick between changing an entry in G and flip it from 1 to (or from to 1) or one entry (fc,£) 

in each one of the initiator matrices in the sequence N. Finally, given current state (G, N) a proposed 
state {G',N') is accepted with probability min {l, 7r(G', 7V'|M)/ Tr{G,N\M)}. 

Quantities to monitor: As in the case of a single observation matrix here we also collect (after 
a sufficient burn-in period) statistics on how many times each entry in the matrices G and Nf (for 
1 < t < T) is present. We denote the average values of G by G and the averages of the initiator matrix 
at time thy Nt. We are also interested in the overall average initator matrix TVaii which is the average over 
all initiator matrices Nt- That is, A^aii = ^/TJ2t^i Nt- We report these quantities in our experiments. 
Starting state and running times: Obviously the state space of the Markov chain constructed for 
this extended version of the problem is huge. Therefore, despite it being connected, we cannot hope 
that the sampling algorithm will converge in polynomial time. We start the Markov chain from the state 
(Go, N°), where Go is the empty graph and for every Nt e it holds that Nt = Mt\ A^t-i- 

Given the current state (G, N), every iteration of the Metropolis-Hastings for the case of multiple 
instances posterior probability for the proposed state {G\N'). If the proposed state (G',N') is such 
that G' = G then the computation of the posterior probability tt' requires 0{Tn^m) time. If on the 
other hand G' ^ G and N' = N, then the computation of the posterior probability requires addition- 
ally a computation of all-pairs-shortest paths on the newly formed graph G' and therefore it requires 
0(T(n^m -I- log n)) time. 
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Figure 1: The original Rocky Mountain dataset consisting of 28 sites and 26 species. 



5 Experiments 

In this section wc show some indicative experimental results we obtained by applying our methods to eco- 
logical and paleontological datasets. The 0-1 matrices in these domains represent the presence/absence 
of species to sites. For each such presence/absence matrix we experiment with, the rows correspond to 
the sites and the columns correspond to the species. Our goal when analyzing these datasets is to find 
connection between sites with respect to, say, the migration of species from one site to another. For 
example, the existence of an edge between site i and site j, with direction from i to j would mean that 
there is high confidence that species appeared first to site i and migrated to site j. Initiators are also 
important; if a site is estimated to be an initiator for a specific species, then this is the site in which the 
species first appeared and then it migrated elsewhere. 

The goal of OTir experiments is to illustrate that our method can provide useful insights in the analysis 
of such datasets, that cannot be obtained by other standard data-analysis methods. 

For all our experiments we set ci = 2 and C2 = 9, since we believe that the cost of an initiator is 
larger than the cost of a single connection in a graph. For the Metropolis-Hastings algorithm, we split 
the sampling process into two stages: the burn-in period and the sampling period. All the experiments 
are run for 2 x 10^ steps. We use the first 10^ steps as the burn-in period and the rest 10^ for actual 
sampling. Every 10^ samples obtained during the actual sampling steps we report the instance of the 
average graph constructed by these 10** samples. In that way, we have 10 different estimates of the 
average graph, Gi, . . . , Giq ordered based on the iteration at which they were obtained. Similarly we get 
ten different estimates of the average initiator matrix A'^i, . . . , A''io. We refer to the last sample Giq (or 
iVio) as the average graph (or the average initiator matrix) and we denote them by G {N respectively). 

For a given dataset, we run the sampling algorithm for values of a in {0.1, . . . , 0.9} and we pick the 
one for which the sampled graph-initiator pairs converge to the largest value for the posterior probability. 

5.1 Experiments with ecological data 

The ecological datasets used for the experiments are available by AICS Research Inc, University Park, 
New Mexico, and The Field Museum, Chicago. The datasets are available online* and they have been 
used for a wide range of ecological studies [3, 6, 18]. 

We focus our attention on the results we obtained by applying our method to a single such dataset; 
the Rocky Mountain dataset. The dataset shows the presence/absence of Boreal and boreo-cordilleran 
species of mammals in the Southern Rocky Mountains, and has been used as a dataset of reference for 
many ecological studies, see for example [3]. The dataset itself is rendered in Figure 1.^ 

Now consider this dataset and let the data analysis task be the inference of directed relationships 
between sites with respect to migration of species between them. Lets first consider dealing with this 
task by using an off-the-shelf clustering algorithm. The idea would be to first cluster the rows of the 
dataset into, say, k groups, and then infer very strong connections among sites in the same group and 

^http : //www. aics-research. com/nestedness/ 

^Throughout we use black to plot the 1-entries of a 0—1 matrix and white to plot the 0-entries. 
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(a) links inferred by clustering 



(b) links inferred by similarities 



Figure 2: Rocky Mountain data - Figure 2(a): connection between sites by strongly connecting sites 
in the same cluster and not connecting sites in different clusters. Figure 2(b): pairwise similarity values 
between sites. 



weaker (or no) connections among sites in different groups. The clustering of the sites of the Rocky 
Mountain dataset into k = 2 clusters and an assignment of weight 1 to the connections between sites in 
the same group, and weight to the connections between the sites results in inferring the (crude) graph 
structure appearing in Figure 2(a). For the clustering we used standard k-mcans algorithm and measured 
the distance between points using the Hamming distance. Notice in the plot the ordering of the sites is 
the same as their ordering in the original dataset in Figure 1. Therefore, the clustering algorithm infers 
strong connections between the species-rich sites and strong connections between the species-poor sites, 
and no connections between them. 

An alternative way of inferring connections between sites would be by defining a similarity function 
between sites and then conclude that high similarity values between two sites implies strong migration 
connection between them. The inferred connections between sites when using as similarity the inverse 
of the Hamming distance between the sites is shown in Figure 2(b). 

Although the pairwise connections between sites extracted by using clustering or similarity computa- 
tions can give some insight on the connections between sites, it is obvious that the connection strengths 
inferred by these computations are rather crude. Moreover, neither of these two methods can give us 
some insight as of which sites have acted as the initiator for certain species. 

Figure 3 shows the results we obtained when analysing the same dataset using our method; in 
Figures 3(a) and 3(c) we plot the average graph G and the average initiator matrix ]Sf obtained by 
the Metropolis-Hastings algorithm for the Rocky Mountain dataset. The parameter a was set to 
a = 0.9, since this was the value of a that was steadily converging to graph-initiator pairs having the 
highest posterior probability tt. Again, the ordering of the sites in the rows of N and in the rows and 
columns of G retain their original ordering in the dataset shown in Figure 1. Recall that the G's are 
directed and thus the corresponding adjacency matrices are not expected to be symmetric. The elements 
in the main diagonal of the adjacency matrix of G are all I's since they represent connections of nodes 
with themselves. 

The average graph-initiator pair (G, N) is rather surprising. The average graph G, reveals much 
different set of connections between sites than those inferred by doing simple clustering or similarity 
computations. In fact, G reveals that for this value of a, there are only local interactions between the 
sites. That is, the interactions between sites with similar concentration of species is stronger than the 
interactions between sites with different species concentration. 

In terms of species migration Figure 3(a) suggests that species migrate from sites with smaller number 
of species to sites with larger number of species. Given the structure of the dataset, this also suggests 
that every site acts as^ an initiator for the species that none of its subsets contains. This is verified by 
the rendering of the N, average matrix of initiators shown in Figure 3(c). 

In order to test the reliability of the obtained results, we also plot the correlation coefficients between 
the estimations of G and N at different steps of the sampling phase of the Metropolis-Hastings 
algorithm. Figures 3(b) and 3(d) show the pairwise correlation coefficients between the average graphs 
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(a) G 



(b) cor. coef. of Gj's 
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(d) cor. coef. of iVj's 



Figure 3: Rocky Mountain data - Metropolis-Hastings results for a = 0.9. Figure 3(a): average 

graph G. Figure 3(b): pairwise correlation coefficients between the average graphs Gi, . . . , Gio obtained 

every 10^ samples in the the sampling stage of the Metropolis-Hastings algorithm. Figure 3(c): and 

average initiator matrix N . Figure 3(d): pairwise correlation coefficients between the average initiator 

matrices TVi, . . . , N\q obtained every 10^ samples in the the sampling stage of the Metropolis-Hastings 
algorithm. 
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(a) G (b) Nm 

Figure 4: Rocky Mountain data with T = 3 artificially generated instances Mi, M2, M3 - M3 corre- 
sponds to the original data. Results of the Metropolis-Hastings algorithm for a = 0.9. Figure 4(a): 
average graph G. Figure 4(b): average initiator matrix A^aii- 




Figure 5: The original paleontological dataset consisting of 60 sites and 70 species. 



Gj's and the average initiator matrices Ni's. These average graphs and initiators are obtained during the 
sampling phase of the Metropolis-Hastings algorithm in a fashion described in the beginning of the 
experimental section. Note that all instances of G^'s and iVi's are highly correlated with each other since 
the lowest value of the correlation coefficient observed is 0.86 and 0.93 respectively, while most of the 
values are very close to 1. This verifies that our sampling algorithm converges to the desired posterior 
probability distribution. 

Naturally, one can ask the question of how different these results would be if instead of a single 
observation matrix M, we had several instances Mi, . . . , Mt that would show how the distribution of 
species to sites evolves over time. Given that such data is not available for the Rocky Mountain 
dataset, we artificially generate it as follows: we set Mt = M and for f = 1, . . . ,r — 1 we generate 
instance Mt by converting every 1-entry in Mt+i to with probability 1/T. In that way Mt C Mt+i as 
we assumed in Section 4. In Figure 4 we show the average graph G and the average initiator matrix TVaii 
we obtained using T = 3 instances of the Rocky Mountain dataset generated as described above. The 
value of a was again set to 0.9. The average graph G shown in Figure 4(a) shows an obvious similarity 
with the average graph G shown in Figure 3(a); in both graphs the relatively strong links appear to start 
from sites with small species concentration towards sites with larger species concentration. The average 
initiator matrix A^aii in Figure 4(b) is much denser than the corresponding N obtained from the single 
instance dataset in Figure 3(c). However, in this case as well it is obvious there; there is a band of high 
values across the diagonal of the matrix that spans the matrix from southwest to northeast. 
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(a) links inferred by clustering (b) links inferred by similarities 

Figure 6: Paleontological data - Figure 6(a): connection between sites by strongly connecting sites in 
the same cluster and not connecting sites in different clusters. Figure 6(b): pairwise similarity values 
between sites. 



5.2 Experiments with paleontological data 

Wc present here a similar set of experiments for a paleontological dataset. The dataset contains informa- 
tion of fossil genera found in palaeontological sites in Europe [9].'^ The original dataset shown in Figure 5 
exhibits some kind of block structure, since by visual inspection one can identify to distinct blocks in the 
rendered matrix. Figures 6(a) and 6(b) show the graph structures among sites that arc inferred using 
A;-means clustering (k = 2) or similarity computations among the rows of the input matrix. As before, 
the estimated graphs appear to be rather crude, and give no information about the initiators of species. 

Figures 7(a) and 7(c) show the G and N estimated by the Metropolis-Hastings sampling algorithm. 
For this experiment we used the same burn-in and sampling scheme as the one described in the beginning 
of the experimental section. For this dataset we show the results for a = 0.4, since this was the value of 
a for which the Metropolis-Hastings algorithm steadily converged to samples of graph-initiator pairs 
with the highest likelihood tt. Both the estimated G and N show that the connections between the sites 
in these case are not so strong; matrix G contains lots of 0-entries, while there are quite many initiators 
in N, when compared to the original matrix TV. This is partly due to the fact that there are not really 
sites in which very similar sets of species appear; thus for every edge introduced in the graph we have 
to pay for a considerable number of false transfers of I's. At the same time the optimal value of a in 
this case being 0.4, has as a result to encourage significant number of initiators; since for relatively small 
values of a transfers of I's are not likely to succeed. 

Figures 7(b) and 7(d) show the the correlation coefficients between all pairs of Gi's and A^j's obtained 
during the sampling period of the run of the Metropolis-Hastings algorithm. Notice that in this case 
the instances of of Gj's and JVj's are not as correlated as in the case of the Rocky Mountain dataset. 
However the still exhibit high correlation since the smallest values obtained in these matrices are 0.6 
and 0.86 respectively, while most of the values are very close to 1, verifying our assumption that the 
algorithm indeed converges reasonably fast. 

6 Extensions and connections to other problems 

Other information propagation models: In the previous sections we have described our approach 
for estimating graphs and initiators given an arbitrary 0-1 observation matrix M. We have studied this 
problem for a specific information propagation model, namely the shortest path propagation model (SP) 
(see Section 2). We have focused our discussion to SP because it is simple and intuitive. However, our 
framework for estimating graphs and initiators can be used with any information propagation model, 
like for example the models studied in [11, 17]. 

Consider an arbitrary (not necessarily deterministic) information propagation model V that given a 
graph G and an initiator matrix N produces a matrix Mp = V{G, N). That is, the I's from the initiating 

^From the August 31, 2007, version of the dataset wc selceted the 70 columns that correspond to the oldest genera with 
at least 10 occurrences and all rows (sites) with at least 10 genera present. 
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(a) G (b) corr. coef. of Gj 's 




(c) N (d) cor. coef. of JVj's 

Figure 7: Paleontological dataset - Metropolis-Hastings results for a = 0.4. Figure 7(a): average 

graph G. Figure 7(b): pairwise correlation coefficients between the average graphs Gi, . . . , Gio obtained 
every 10^ samples in the the sampling stage of the Metropolis-Hastings algorithm. Figure 7(c): and 

average initiator matrix N . Figure 7(d): pairwise correlation coefficients between the average initiator 

matrices A^i, . . . , A''io obtained every 10^ samples in the the sampling stage of the Metropolis-Hastings 
algorithm. 
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rows of N arc propagatc;(l to other rows via the edges of the graph G and according to the propagation 
rules determined by the propagation model V. In that way matrix Mp = V{G, N) is formed. 

Now assume that the problem that we considered before. Namely, given a 0-1 observation matrix M 
and propagation model V estimate the structure (existence and strength of connections) of a directed 
graph G and the initiators N. In other words, we again ask for a sample of graphs and initiators from 
the posterior probability distribution 

Fr{G,N\M) oc Pr(M|G, iV)Pr(G)Pr(iV). 

Thus the task is the same as before. The only term in the above formula that depends on the propagation 
model is the term Pr(M|G, N). However, since G and N are known, matrix Mp = V{G, N) is also known. 
Therefore for some constant c we can substitute Pr(M|G, A'') with 

Pr{M\G,N) oc e-^l^-^^''! -e-'=l^^-^(«'^)l. (9) 

Note that in the above equation \M — Mp\ refers to the sum of absolute values of the entries of matrix 
M — Mp. Therefore, Equation (9) penalizes for graphs and initiators that generate matrices Mp that 
are considerably different from the observed matrix M. Therefore, under any propagation model V, 
Equation (9) can be used to compute the probability of the observed matrix M given G and N. With 
this, we can deploy our sampling algorithms under any propagation model. 

Notice that in general, the information propagation models do not have to be deterministic and 
consequently neither docs the matrix Mp. For non-deterministic propagation models Mp — E^V (G, N) ] . 
That is, in these cases Mp is the expectation of the constructed matrix, where the expectation is taken 
over the random choices of the propagation model. 

Connections to other problems: In the preceding discussion we have assumed that only the observa- 
tion matrix M is given and the goal was to estimate the strengths of directed links between the rows of 
M, as well as the beliefs for certain rows being initiators for signals that correspond to certain c;olumns. 
Here, we discuss some optimization problems that arise in the case where apart from the observation 
matrix M, also the graph G is given as part of the input. The natural question to ask in this case is 
formalized in the following problem definition. 

Problem 1 (Minimum Initiation problem). Given an n x m 0-1 observation matrix M, graph G, 
propagation model V and integer k, find an initiator matrix N with with at most k 1-entries so that 
\M — V {G, N)\ is minimized. 

We can very easily observe, that this problem is already hard for many well-known information 

propagation models, like for example the linear threshold (LT) and the independent cascade (IC) model 
described in [11]. Here we are not giving a detailed description of these two propagation models, we 
refer to [11] for this. For the rest of the discussion we can treat them as a black-box, bearing in mind 
that they are non-deterministic. We state the hardness result of the Minimum Initiation problem in 

the observation below and we discuss it immediately after. 

Observation 1. For V being either the linear threshold (LT) or the independent cascade (IG) models, 
the Minimum Initiation problem is NP-hard. 

The above observation is an immediate consequence of [11]. A careful observation of the results 

in [11], show that the Minimum Initiation problem is NP-hard even for the case where the observation 
matrix M of size n x 1, all entries of M have value 1 and \M — V (G, N)\ is required to be 0. The 
following inapproximability result is an immediate consequence of this observation. 

Corollary 1. When V is either the linear threshold (LT) or the independent cascade (IC) models, there 
is no polynomial-time approximation algorithm A for the Minimum Initiation problem, such that the 
solution of A is within a factor a > 1 from the optimal solution to the Minimum Initiation problem, 
unless P=NP. 

7 Conclusions 

We have introduced a model for finding the links and initiators that could be responsible for the observed 
data matrix. We described a simple propagation model for generating data from links and initiators, and 
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gave an MCMC algorithm for sampling from the right posterior distribution of graphs and initiators. We 
also studied some decision versions of the problem and showed their connections to related work from 
social-network analysis. 

The empirical results show that the MCMC approach is - perhaps surprisingly able to find mean- 
ingful structure in this setting. While there is no unique answer, the runs converge to certain average 
values of the graph-initiator pairs. 

Obviously there arc lots of open questions in the approach. An interesting question is selection among 
the different propagation models. The MCMC method converges fairly slowly, and therefore it would be 
of interest to see whether the individual iterations could be made faster. Ecological interpretation of the 
results is among the themes that we are currently pursuing actively. 
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